This report explores a dataset containing amounts, APR, and personal financial information for approximately 85,000 loans over 5 years.
## [1] 113937 81
The data contains eighty one variables. I hope to determine what other factors besides Prosper Rating may indicate whether or not a borrower will default or miss payments on a loan. Prosper Rating was not added to the data until 2009 so for this analysis I will take a subset of the data to include only records where the Propser Rating exists and loans that have not been cancelled. I am intentionally displaying data manipulations here per the submission guidelines.
# Intentionally display this section in the report so data manipulations can be
# observed.
# subset the Prosper Data to capture data of interest and exclude cancelled
#loans
spd <- subset(pd, LoanStatus != "Cancelled" & !is.na(ProsperRating..numeric.),
select = c(Term, LoanStatus, BorrowerAPR, ProsperRating..numeric.,
ProsperRating..Alpha., ListingCategory..numeric.,
Occupation, IsBorrowerHomeowner, CreditScoreRangeLower,
CreditScoreRangeUpper, CurrentCreditLines,
CurrentDelinquencies, DebtToIncomeRatio, IncomeRange,
LoanOriginalAmount, LoanOriginationDate))
dim(spd)
## [1] 84853 16
summary(spd)
## Term LoanStatus BorrowerAPR
## Min. :12.00 Current :56576 Min. :0.04583
## 1st Qu.:36.00 Completed :19664 1st Qu.:0.16328
## Median :36.00 Chargedoff : 5336 Median :0.21945
## Mean :42.49 Defaulted : 1005 Mean :0.22666
## 3rd Qu.:60.00 Past Due (1-15 days) : 806 3rd Qu.:0.29254
## Max. :60.00 Past Due (31-60 days): 363 Max. :0.42395
## (Other) : 1103
## ProsperRating..numeric. ProsperRating..Alpha. ListingCategory..numeric.
## Min. :1.000 C :18345 Min. : 0.000
## 1st Qu.:3.000 B :15581 1st Qu.: 1.000
## Median :4.000 A :14551 Median : 1.000
## Mean :4.072 D :14274 Mean : 3.313
## 3rd Qu.:5.000 E : 9795 3rd Qu.: 3.000
## Max. :7.000 HR : 6935 Max. :20.000
## (Other): 5372
## Occupation IsBorrowerHomeowner CreditScoreRangeLower
## Other :21317 False:40005 Min. :600.0
## Professional :10542 True :44848 1st Qu.:660.0
## Executive : 3468 Median :700.0
## Computer Programmer: 3236 Mean :699.4
## Teacher : 2888 3rd Qu.:720.0
## Analyst : 2735 Max. :880.0
## (Other) :40667
## CreditScoreRangeUpper CurrentCreditLines CurrentDelinquencies
## Min. :619.0 Min. : 0.00 Min. : 0.0000
## 1st Qu.:679.0 1st Qu.: 7.00 1st Qu.: 0.0000
## Median :719.0 Median :10.00 Median : 0.0000
## Mean :718.4 Mean :10.51 Mean : 0.3225
## 3rd Qu.:739.0 3rd Qu.:13.00 3rd Qu.: 0.0000
## Max. :899.0 Max. :59.00 Max. :51.0000
##
## DebtToIncomeRatio IncomeRange LoanOriginalAmount
## Min. : 0.000 $50,000-74,999:25627 Min. : 1000
## 1st Qu.: 0.150 $25,000-49,999:24175 1st Qu.: 4000
## Median : 0.220 $100,000+ :15205 Median : 7500
## Mean : 0.259 $75,000-99,999:14498 Mean : 9083
## 3rd Qu.: 0.320 $1-24,999 : 4654 3rd Qu.:13500
## Max. :10.010 Not employed : 649 Max. :35000
## NA's :7296 (Other) : 45
## LoanOriginationDate
## 2014-01-22 00:00:00: 491
## 2013-11-13 00:00:00: 490
## 2014-02-19 00:00:00: 439
## 2013-10-16 00:00:00: 434
## 2014-01-28 00:00:00: 339
## 2013-09-24 00:00:00: 316
## (Other) :82344
# Convert Loan Origination Date from factor to more usable date
spd$LoanDate <- as.Date(ymd_hms(spd$LoanOriginationDate))
# Rearrange the order of the Loan status rather than leaving it in alphabetical
# order
spd$LoanStatus <- ordered(spd$LoanStatus, levels = c("Completed",
"FinalPaymentInProgress",
"Current",
"Past Due (1-15 days)",
"Past Due (16-30 days)",
"Past Due (31-60 days)",
"Past Due (61-90 days)",
"Past Due (91-120 days)",
"Past Due (>120 days)",
"Chargedoff",
"Defaulted" ))
# Add an numeric field and set the value to a corresponding LoanStatus value,
# 0 is the best and 8 is the worst
spd$StatusCode <- NA
spd <- within(spd, {
StatusCode[LoanStatus == "Completed" |
LoanStatus == "FinalPaymentInProgress"] <- 0
StatusCode[LoanStatus == "Current"] <- 1
StatusCode[LoanStatus == "Past Due (1-15 days)"] <- 2
StatusCode[LoanStatus == "Past Due (16-30 days)"] <- 3
StatusCode[LoanStatus == "Past Due (31-60 days)"] <- 4
StatusCode[LoanStatus == "Past Due (61-90 days)"] <- 5
StatusCode[LoanStatus == "Past Due (91-120 days)"] <- 6
StatusCode[LoanStatus == "Past Due (>120 days)"] <- 7
StatusCode[LoanStatus == "Chargedoff" | LoanStatus == "Defaulted"] <- 8
}
)
# Rearrange order of income range so it makes sense when displayed graphically
spd$IncomeRange <- ordered(spd$IncomeRange, levels = c("Not employed",
"Not displayed", "$0",
"$1-24,999",
"$25,000-49,999",
"$50,000-74,999",
"$75,000-99,999",
"$100,000+"))
# Rearrange order of Prosper rating since AA is better than A
spd$ProsperRating..Alpha. <- ordered(spd$ProsperRating..Alpha.,
levels = c("AA","A", "B", "C", "D",
"E", "HR"))
# If the range of Credit scores exists calculate the mean of the Lower and
# Upper Credit score for each borrower, otherwise set it to NA.
spd$CreditScoreMean <- ifelse(is.na(spd$CreditScoreRangeLower), NA,
(spd$CreditScoreRangeLower +
spd$CreditScoreRangeUpper)/2)
# For the rest of our analysis we only want to consider records with a
# CreditScoreMean greater than 9.5 since during Univariate analysis
# we determined these records did not make any sense.
spd <- subset(spd, CreditScoreMean > 9.5)
# Let's just select data with status code of either default or complete
# and save that in a new dataframe for faster processing
spd.sts <- spd %>%
filter((StatusCode == 0 | StatusCode == 8))
# and lets change it to a factor variable for displaying nicely on graphs
spd.sts$StatusCode <- factor(spd.sts$StatusCode, order = TRUE, levels = c(0, 8),
labels = c("Closed", "Defaulted"))
Two additional variables were added in the above code to give 19 variables.
## [1] 84853 19
## 'data.frame': 84853 obs. of 19 variables:
## $ Term : int 36 36 36 60 36 36 36 36 60 36 ...
## $ LoanStatus : Ord.factor w/ 11 levels "Completed"<"FinalPaymentInProgress"<..: 3 3 3 3 3 3 3 3 3 4 ...
## $ BorrowerAPR : num 0.12 0.125 0.246 0.154 0.31 ...
## $ ProsperRating..numeric. : int 6 6 3 5 2 4 7 7 4 5 ...
## $ ProsperRating..Alpha. : Ord.factor w/ 7 levels "AA"<"A"<"B"<"C"<..: 2 2 5 3 6 4 1 1 4 3 ...
## $ ListingCategory..numeric.: int 2 16 2 1 1 2 7 7 1 1 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 43 52 21 43 50 29 24 24 22 50 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 1 2 2 2 1 1 2 2 1 1 ...
## $ CreditScoreRangeLower : int 680 800 680 740 680 700 820 820 640 680 ...
## $ CreditScoreRangeUpper : int 699 819 699 759 699 719 839 839 659 699 ...
## $ CurrentCreditLines : int 14 5 19 21 10 6 17 17 2 9 ...
## $ CurrentDelinquencies : int 0 4 0 0 0 0 0 0 1 0 ...
## $ DebtToIncomeRatio : num 0.18 0.15 0.26 0.36 0.27 0.24 0.25 0.25 0.12 0.18 ...
## $ IncomeRange : Ord.factor w/ 8 levels "Not employed"<..: 6 5 8 8 5 5 5 5 7 5 ...
## $ LoanOriginalAmount : int 10000 10000 15000 15000 3000 10000 10000 10000 13500 4000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 1866 1535 1757 1821 1649 1666 1813 1813 1419 1829 ...
## $ LoanDate : Date, format: "2014-03-03" "2012-11-01" ...
## $ StatusCode : num 1 1 1 1 1 1 1 1 1 2 ...
## $ CreditScoreMean : num 690 810 690 750 690 ...
## Term LoanStatus BorrowerAPR
## Min. :12.00 Current :56576 Min. :0.04583
## 1st Qu.:36.00 Completed :19664 1st Qu.:0.16328
## Median :36.00 Chargedoff : 5336 Median :0.21945
## Mean :42.49 Defaulted : 1005 Mean :0.22666
## 3rd Qu.:60.00 Past Due (1-15 days) : 806 3rd Qu.:0.29254
## Max. :60.00 Past Due (31-60 days): 363 Max. :0.42395
## (Other) : 1103
## ProsperRating..numeric. ProsperRating..Alpha. ListingCategory..numeric.
## Min. :1.000 AA: 5372 Min. : 0.000
## 1st Qu.:3.000 A :14551 1st Qu.: 1.000
## Median :4.000 B :15581 Median : 1.000
## Mean :4.072 C :18345 Mean : 3.313
## 3rd Qu.:5.000 D :14274 3rd Qu.: 3.000
## Max. :7.000 E : 9795 Max. :20.000
## HR: 6935
## Occupation IsBorrowerHomeowner CreditScoreRangeLower
## Other :21317 False:40005 Min. :600.0
## Professional :10542 True :44848 1st Qu.:660.0
## Executive : 3468 Median :700.0
## Computer Programmer: 3236 Mean :699.4
## Teacher : 2888 3rd Qu.:720.0
## Analyst : 2735 Max. :880.0
## (Other) :40667
## CreditScoreRangeUpper CurrentCreditLines CurrentDelinquencies
## Min. :619.0 Min. : 0.00 Min. : 0.0000
## 1st Qu.:679.0 1st Qu.: 7.00 1st Qu.: 0.0000
## Median :719.0 Median :10.00 Median : 0.0000
## Mean :718.4 Mean :10.51 Mean : 0.3225
## 3rd Qu.:739.0 3rd Qu.:13.00 3rd Qu.: 0.0000
## Max. :899.0 Max. :59.00 Max. :51.0000
##
## DebtToIncomeRatio IncomeRange LoanOriginalAmount
## Min. : 0.000 $50,000-74,999:25627 Min. : 1000
## 1st Qu.: 0.150 $25,000-49,999:24175 1st Qu.: 4000
## Median : 0.220 $100,000+ :15205 Median : 7500
## Mean : 0.259 $75,000-99,999:14498 Mean : 9083
## 3rd Qu.: 0.320 $1-24,999 : 4654 3rd Qu.:13500
## Max. :10.010 Not employed : 649 Max. :35000
## NA's :7296 (Other) : 45
## LoanOriginationDate LoanDate StatusCode
## 2014-01-22 00:00:00: 491 Min. :2009-07-20 Min. :0.000
## 2013-11-13 00:00:00: 490 1st Qu.:2012-02-23 1st Qu.:1.000
## 2014-02-19 00:00:00: 439 Median :2013-04-09 Median :1.000
## 2013-10-16 00:00:00: 434 Mean :2012-11-15 Mean :1.351
## 2014-01-28 00:00:00: 339 3rd Qu.:2013-11-05 3rd Qu.:1.000
## 2013-09-24 00:00:00: 316 Max. :2014-03-12 Max. :8.000
## (Other) :82344
## CreditScoreMean
## Min. :609.5
## 1st Qu.:669.5
## Median :709.5
## Mean :708.9
## 3rd Qu.:729.5
## Max. :889.5
##
The histogram of loan origination dates is skewed to the left indicating more loans have been funded recently
At first glance APR values appear to have a fairly symmetric, unimodal distribution
Loan amounts are skewed to the right indicating higher loan amounts are less common.
##
## Completed FinalPaymentInProgress Current
## 19664 205 56576
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 806 265 363
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 313 304 16
## Chargedoff Defaulted
## 5336 1005
From the bar graph and table we see over 6,000 loans are in a default or charged off status
## [1] 0.2419306
This graph uses the newly created variable StatusCode where Loan Status values Completed & FinalPaymentinProgress were combined as were Chargedoff and Defaulted. Approximately 24% of the finished loans are in a default status.
From this graph we can see about 10,000 more borrowers own homes than those who do not.
It appears a majority of borrowers have income in the $25,000 to $75,000 range. This is a self report field so the actual earnings may be different and we will need to keep this in mind when attempting to draw conclusions.
The distribution of scores determined by Prosper are symmetrical but with the mode, or most common score being 4.
Occupation is selected by the borrower at the time they created the listing which could account for the fact that “Other” is the most common occupation.
Three years or 36 months is the most common loan term length.
Reason “Not Available” is the most common response. I wonder if this will affect the APR rate assigned to the loan and will analyze later in the bivariate analysis.
## [1] 84853
I realize this is technically a bivariate plot but I want to see if creating one variable from them is appropriate to to then create a univariate plot. This scatterplot shows the Upper and Lower Credit Scores of individuals as reported by consumer credit rating agencies. Since the pattern of these values form a line axis it indicates the reporting agencies return about the same score for individuals with similar personal financial information. Due to this it seems using the mean of these two scores is appropriate.
The graph showing the average credit score for each borrower is skewed somewhat to the right with a peak around 700.
At the time the credit profile was pulled most people had between 5 and 10 Current Credit Lines. We can again see the data is skewed to the right
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 71252 8223 2631 1039 611 310 232 167 106 79 54 43
## 12 13 14 15 16 17 18 19 20 21 22 24
## 23 26 11 12 9 9 2 1 1 5 1 2
## 27 32 51
## 2 1 1
I took the square root of the number of Delinquencies and transformed the y scale by taking the square root. The majority of borrower’s had 0 delinquencies at the time of requesting a loan. In rare cases we see 4 or more delinquencies. Surprisingly we we see a borrowers with over 20 delinquencies. It will be very interesting to see how delinquencies might relate to loan status.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.150 0.220 0.259 0.320 10.010 7296
According to the Prosper documentation the possible value is capped at 10.01. I decided to split the graph into two sections in order to zoom in on the values.
The most common debt to income ratio is between .15 and .20. I find it quite interesting that the Debt to Income ratio from 0 to 0.5 almost fits the normal model. It is quite unusual for borrowers to have a ratio above .5.
There were 113,937 loans in the dataset with 81 variables. I considered a subset of the variables(Term, LoanStatus, BorrowerAPR, ProsperRating..numeric., ProsperRating..alpha., ListingCategory..numeric.,Occupation, IsBorrowerHomeowner, CreditScoreRangeLower, CreditScoreRangeUpper, CurrentCreditLines, CurrentDelinquencies, DebtToIncomeRatio, IncomeRange, LoanOriginalAmount, LoanOriginationDate) and excluded any records that had a loan status of “Cancelled”or did not have a prosper rating. This resulted in 84,853 loans left for analysis.
In the original dataset LoanStatus, Occupation, IsBorrowerHomeowner, IncomeRange, and LoanOriginationDate were all factor variables. LoanStatus and IncomeRange were not ordered and I decided to apply an order to them which is reflected below along with the factors of the other variables.
LoanStatus - Ordered “Completed”, “FinalPaymentInProgress”, “Current”, “Past Due (1-15 days)”, “Past Due (16-30 days)”, “Past Due (31-60 days)”, “Past Due (61-90 days)”, “Past Due (91-120 days)”, “Past Due (>120 days)”, “Chargedoff”, “Defaulted”
Occupation, 68 levels - no Order
| Occupations | Occupations | Occupations |
|---|---|---|
| ’’ | Accountant/CPA | Administrative Assistant |
| Analyst | Architect | Attorney |
| Biologist | Bus Driver | Car Dealer |
| Chemist | Civil Service | Clergy |
| Clerical | Computer Programmer | Construction |
| Dentist | Doctor | Engineer - Chemical |
| Engineer - Electrical | Engineer - Mechanical | Executive |
| Fireman | Flight Attendant | Food Service |
| Food Service Management | Homemaker | Investor |
| Judge | Laborer | Landscaping |
| Medical Technician | Military Enlisted | Military Officer |
| Nurse’s Aide | Nurse (LPN) | Nurse (RN) |
| Other | Pharmacist | Pilot - Private/Commercial |
| Police Officer/Correction Officer | Postal Service | Principal |
| Professional | Professor | Psychologist |
| Realtor | Religious | Retail Management |
| Sales - Commission | Sales - Retail | Scientist |
| Skilled Labor | Social Worker | Student - College Freshman |
| Student - College Graduate Student | Student - College Junior | Student - College Senior |
| Student - College Sophomore | Student - Community College | Student - Technical School |
| Teacher | Teacher’s Aide | Tradesman - Carpenter |
| Tradesman - Electrician | Tradesman - Mechanic | Tradesman - Plumber |
| Truck Driver | Waiter/Waitress |
IsBorrowerHomewowner “False”, “True”
IncomeRange - Ordered “Not employed”, “Not displayed”, “$0”, “$1-24,999”, “$25,000-49,999”, “$50,000-74,999”, “$75,000-99,999”, “$100,000+”
LoanOriginationDate Factor variable which was not particularly useful for my analysis so I created the new variable LoanDate using this data. Please see below for additional information.
The main features in the data set are ProsperRating..numeric. and Loan Status. I hope to determine which features are best for predicting if the loan associated with a borrower’s loan status. Since Prosper provides a service to both borrowers and investors it would be useful for investors to be able to predict which loans are most likely to be completed. Prosper’s website for available loans to invest in gives the Loan Category which is character value corresponding to ProsperRating..numeric. AA or numerically 7 is the best rating with HR(i.e. 1) the worst. This field is only applicable for loans originated after July 2009. In order to see what other information is available regarding the borrower’s particulars you must register as an investor with Prosper. It will be intersting to see if the Prosper Score is truly the best predictor or if other features are better.
APR, Occupation, Listing Category, IsBorrowerHomeowner, Credit Scores, CurrentDelinquencies, DebtToIncomeRatio, and IncomeRange are potential indicators of whether or not a loan will end in default or charge off.
I used the variable LoanOriginationDate to create the variable LoanDate which is a date variable in YYYY-MM-DD format. The original variable was a factor variable and contained a timestamp as well which is not useful for my analysis.
LoanStatus was used to create the variable StatusCode, a numeric field in dataframe SPD, which I consider to be my primary response(dependent) variable. I considered Completed & FinalPaymentinProgress essentially the same status but due to timing are in different categories and assigned 0 to StatusCode. The same is true of Chargedoff and Defaulted so they were assigned status code 8.
CreditScoreMean was created by taking the average of a borrower’s CreditScoreRangeLower and CreditScoreRangeUpper. Multiple agencies may be contacted for a borrower’s credit scrore. During the Univariate section I did do a scatterplot of these two scores which I know is bivariate analysis. I thought this was the appropriate time to see what the relationship was between these two scores so that I would know which one to use during the true bivariate analysis phase when comparing Credit Score to Loan Status. Since their relationship was linear it seemed appropriate to use the average of Lower and Upper Credit Scores.
spd.sts is a dataframe which is a subset of spd and includes only records that are in a default(8) or closed(1) status. I then changed the StatusCode in this dataframe to a factor variable with two levels. This file is used in the bivariate and multivariate analysis.
I did use lubridate to create the new field Loan Date to make this feature more useful as a time stamp if not necessary for my analysis.
For the number of delinquencies to standout graphically I decided to take the square root of both the CurrentDelinquencies and y scale.
I thought it was interesting that the number of homeowners and non-homeowners is about the same. It will be interesting to see if one or the other is more likely to default on a loan.
##
## Pearson's product-moment correlation
##
## data: spd$ProsperRating..numeric. and spd$StatusCode
## t = -55.129, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1924422 -0.1794506
## sample estimates:
## cor
## -0.1859545
There is not a linear relationship between Prosper Risk Rating and the status of a loan as indicated by the correlation coefficient of -.19 as well as the plot. Let’s see if a stronger relationship between other variables exists.
## BorrowerAPR ProsperRating..numeric.
## BorrowerAPR 1.0000000 -0.96215126
## ProsperRating..numeric. -0.9621513 1.00000000
## ListingCategory..numeric. 0.1087837 -0.09447405
## CurrentCreditLines -0.1095986 0.09237706
## CurrentDelinquencies 0.1538153 -0.14520526
## DebtToIncomeRatio 0.1288220 -0.13534359
## LoanOriginalAmount -0.4263610 0.42855722
## StatusCode 0.2166196 -0.18595454
## CreditScoreMean -0.5258881 0.54887385
## ListingCategory..numeric. CurrentCreditLines
## BorrowerAPR 0.108783740 -0.10959863
## ProsperRating..numeric. -0.094474047 0.09237706
## ListingCategory..numeric. 1.000000000 -0.13339337
## CurrentCreditLines -0.133393374 1.00000000
## CurrentDelinquencies 0.062200585 -0.13152708
## DebtToIncomeRatio -0.041342845 0.14661499
## LoanOriginalAmount -0.202322201 0.19296190
## StatusCode 0.031855085 -0.06747953
## CreditScoreMean -0.007928902 0.09335291
## CurrentDelinquencies DebtToIncomeRatio
## BorrowerAPR 0.15381529 0.12882198
## ProsperRating..numeric. -0.14520526 -0.13534359
## ListingCategory..numeric. 0.06220058 -0.04134284
## CurrentCreditLines -0.13152708 0.14661499
## CurrentDelinquencies 1.00000000 -0.03839116
## DebtToIncomeRatio -0.03839116 1.00000000
## LoanOriginalAmount -0.11111509 -0.01783746
## StatusCode 0.05701356 0.04589815
## CreditScoreMean -0.16013458 -0.01370880
## LoanOriginalAmount StatusCode CreditScoreMean
## BorrowerAPR -0.42636102 0.21661957 -0.525888129
## ProsperRating..numeric. 0.42855722 -0.18595454 0.548873850
## ListingCategory..numeric. -0.20232220 0.03185508 -0.007928902
## CurrentCreditLines 0.19296190 -0.06747953 0.093352909
## CurrentDelinquencies -0.11111509 0.05701356 -0.160134579
## DebtToIncomeRatio -0.01783746 0.04589815 -0.013708795
## LoanOriginalAmount 1.00000000 -0.07535678 0.277918466
## StatusCode -0.07535678 1.00000000 -0.089583397
## CreditScoreMean 0.27791847 -0.08958340 1.000000000
The correlation between the loan risk rating assigned by Prosper and the offered APR is quite strong with r = -.96. This is not surprising since one would expect borrowers with good(high ratings) credit history to be offered a lower APR than one who presents as bad risk(low rating) Additionally it appears that the APR in turn has a slightly stronger relationship r = .22 with loan status code than the Prosper rating. I wonder how allowing for the categorical variables affects the relationship.
Proper Risk Rating and Credit Score Mean are only moderately correlated with r = .55. This indicates Prosper does not use only Credit Scores to determine the risk rating of a loan
I want to look closer at scatter plots involving Status Code as it relates to Debt/Income, Number of Credit Lines, Reason, Number of Delinquencies, and Mean Credit Score. I don’t believe we will see much of a linear relationship but it will let me get quick idea of where values are clustered.
## StatusCode
## 0 1 2 3 4 5 6 7 8
## 0 3 2 0 0 0 0 0 0 1
## 0.01 31 12 0 0 0 1 0 0 2
## 0.02 115 63 1 0 0 1 0 0 23
## 0.03 187 153 2 0 3 3 0 1 53
## 0.04 235 210 4 1 3 1 1 0 54
## 0.05 301 340 11 3 4 4 5 0 70
## 0.06 351 484 13 2 6 3 3 0 75
## 0.07 410 633 6 6 3 4 2 0 93
## 0.08 525 866 17 2 10 5 4 0 134
## 0.09 403 756 19 4 5 8 4 1 86
## 0.1 508 975 12 5 4 5 3 0 117
## 0.11 547 1240 25 3 8 11 10 0 127
## 0.12 586 1258 14 9 7 5 9 0 145
## 0.13 651 1489 12 7 6 8 11 0 154
## 0.14 778 1672 30 6 12 11 5 0 189
## 0.15 608 1617 16 8 9 5 11 0 159
## 0.16 659 1688 24 7 11 4 11 1 162
## 0.17 710 1789 27 10 15 7 7 1 183
## 0.18 835 2066 29 7 13 20 12 0 231
## 0.19 518 1740 18 2 17 11 11 0 127
## 0.2 566 1789 14 5 15 10 7 0 143
## 0.21 551 1794 28 6 10 10 6 1 172
## 0.22 713 2003 24 13 10 9 13 2 212
## 0.23 438 1655 17 4 7 2 2 1 116
## 0.24 479 1656 24 6 5 2 3 0 122
## 0.25 435 1649 18 8 7 2 8 0 107
## 0.26 392 1627 22 9 6 5 6 0 112
## 0.27 389 1566 20 2 11 7 6 0 123
## 0.28 397 1515 17 6 11 6 3 0 119
## 0.29 357 1341 19 5 8 10 7 0 115
## 0.3 329 1353 22 2 6 6 3 0 100
## 0.31 326 1275 6 3 4 3 4 0 116
## 0.32 300 1186 14 7 8 6 6 1 110
## 0.33 275 1104 19 4 7 2 4 0 98
## 0.34 258 1042 15 5 4 6 5 1 91
## 0.35 296 1031 9 13 4 4 8 1 97
## 0.36 188 908 7 6 11 2 4 0 65
## 0.37 189 786 8 3 3 7 5 0 77
## 0.38 178 777 11 2 3 3 5 0 61
## 0.39 163 701 14 2 2 6 8 1 42
## 0.4 152 623 13 3 7 2 2 0 67
## 0.41 108 633 5 6 1 1 3 0 46
## 0.42 117 506 8 1 4 2 4 0 62
## 0.43 98 500 5 5 1 3 2 0 43
## 0.44 106 428 4 1 3 2 3 0 41
## 0.45 94 383 12 3 4 5 3 0 41
## 0.46 83 347 7 2 5 2 2 0 42
## 0.47 72 358 4 3 2 1 2 0 41
## 0.48 52 296 3 1 1 1 2 0 35
## 0.49 65 249 8 0 3 2 0 0 23
## 0.5 51 241 7 0 0 2 4 0 33
Between .05 and .35 we see an increase in the number of defaults and late payments.
This doesn’t give me much information about how the number of Current Delinquencies at the time the profile was pulled affects default status. I want to zoom in to see if their is a number range that might indicate a borrower will default.
## StatusCode
## CurrentDelinquencies 0 1 2 3 4 5 6 7 8
## 0 16802 48033 619 198 281 230 237 11 4841
## 1 1824 5295 109 36 50 42 44 4 819
## 2 614 1652 42 11 13 13 10 0 276
## 3 236 615 13 10 5 11 4 1 144
## 4 134 355 4 3 4 5 2 0 104
## 5 60 188 4 3 4 4 3 0 44
## 6 63 121 7 0 0 4 1 0 36
## 7 40 97 2 1 1 0 1 0 25
## 8 28 60 2 1 0 1 1 0 13
## 9 18 45 1 1 2 0 1 0 11
## 10 10 35 0 0 1 0 0 0 8
From the graph and table at and above 0 Delinquencies the difference between the number of Completed vs Defaulted loans begins to decrease.
## CurrentCreditLines
## StatusCode 0 1 2 3 4 5 6 7 8 9 10 11
## 0 55 254 500 824 1105 1383 1561 1662 1784 1620 1451 1399
## 1 74 245 630 1114 1768 2992 3652 4867 4993 4973 4765 4407
## 2 3 15 22 35 38 49 47 73 75 78 60 48
## 3 1 1 6 9 12 15 19 25 16 19 18 28
## 4 1 4 15 13 23 20 28 38 26 32 31 18
## 5 1 3 8 21 18 29 28 29 14 21 25 22
## 6 4 4 6 7 14 24 32 31 16 27 28 13
## 7 0 1 2 1 0 1 1 1 1 2 1 1
## 8 80 196 295 352 444 443 518 520 474 468 410 385
## CurrentCreditLines
## StatusCode 12 13 14 15 16 17 18 19 20 21 22 23
## 0 1150 994 841 668 593 440 343 261 221 155 133 106
## 1 3816 3304 2862 2356 2038 1515 1348 1073 809 643 486 380
## 2 50 36 37 21 31 22 14 10 10 6 5 4
## 3 16 15 17 8 9 3 3 4 5 5 0 1
## 4 23 15 17 14 10 11 6 1 6 3 2 1
## 5 15 13 11 10 11 6 7 2 5 2 0 6
## 6 20 12 14 16 6 6 5 4 4 2 1 2
## 7 2 1 0 0 0 0 0 0 0 0 0 0
## 8 326 291 241 185 167 124 109 73 57 46 27 26
Now this is quite interesting. For borrower’s with 0 & 1 credit lines a similar number of defaults compared to complete loans. And then as the number of credit lines increases a great deal more completed loans than defaulted. So it seems having existing credit lines is a slight indication that the loan will be completed rather than defaulted.
For borrower’s with credit scores near 600 about the same number of borrower’s defaulting vs borrower’s completing their loans while for those above 600 more do complete their loan but quite a few people with scores between 650 and 750 default.
## [1] "Reason x Status Code table below"
##
## 0 1 2 3 4 5 6 7 8
## 0 0.35 0.60 0.00 0.00 0.00 0.00 0.05 0.00 0.00
## 1 0.19 0.74 0.01 0.00 0.00 0.00 0.00 0.00 0.06
## 2 0.30 0.58 0.01 0.00 0.00 0.01 0.00 0.00 0.09
## 3 0.32 0.52 0.01 0.00 0.01 0.01 0.01 0.00 0.13
## 5 0.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.16
## 6 0.43 0.44 0.01 0.00 0.00 0.00 0.00 0.00 0.11
## 7 0.40 0.46 0.01 0.00 0.00 0.00 0.00 0.00 0.12
## 8 0.14 0.77 0.01 0.02 0.00 0.00 0.01 0.00 0.07
## 9 0.29 0.65 0.01 0.00 0.00 0.00 0.01 0.00 0.04
## 10 0.37 0.48 0.02 0.00 0.00 0.00 0.01 0.00 0.11
## 11 0.27 0.67 0.00 0.01 0.01 0.00 0.00 0.00 0.03
## 12 0.19 0.61 0.00 0.02 0.02 0.00 0.02 0.00 0.15
## 13 0.19 0.66 0.01 0.00 0.01 0.00 0.01 0.00 0.11
## 14 0.16 0.74 0.02 0.01 0.01 0.01 0.00 0.00 0.05
## 15 0.16 0.71 0.02 0.01 0.01 0.00 0.01 0.00 0.09
## 16 0.29 0.66 0.01 0.00 0.01 0.00 0.00 0.00 0.03
## 17 0.31 0.65 0.00 0.00 0.02 0.00 0.00 0.00 0.02
## 18 0.18 0.72 0.01 0.01 0.01 0.00 0.00 0.00 0.06
## 19 0.20 0.71 0.01 0.00 0.00 0.00 0.01 0.00 0.07
## 20 0.17 0.75 0.01 0.00 0.01 0.00 0.01 0.00 0.06
##
## 0 1 2 3 4 5 6 7 8
## 0 7 12 0 0 0 0 1 0 0
## 1 9868 39194 457 138 196 172 161 7 2987
## 2 2016 3963 81 30 31 39 27 2 612
## 3 1679 2738 56 21 33 29 40 1 701
## 5 231 0 0 0 0 0 0 0 43
## 6 964 975 18 8 7 6 10 0 249
## 7 3645 4218 86 26 34 38 25 3 1143
## 8 27 153 2 3 0 0 1 0 13
## 9 25 55 1 0 0 0 1 0 3
## 10 34 44 2 0 0 0 1 0 10
## 11 59 145 1 2 2 1 1 0 6
## 12 11 36 0 1 1 0 1 0 9
## 13 377 1321 26 8 17 9 10 2 226
## 14 141 652 14 5 7 7 4 0 46
## 15 240 1078 28 11 17 7 10 0 131
## 16 87 201 2 0 4 1 0 1 8
## 17 16 34 0 0 1 0 0 0 1
## 18 163 639 13 6 6 2 2 0 54
## 19 151 543 10 3 2 2 4 0 53
## 20 128 575 9 3 5 0 5 0 46
The most common reason code is 1(Loan Consolidation) yet uthas a lower percentage of defaulted loans than reason code 5 which has 16% of it’s loans in a default status. I’m probably not going to pursue whether or not Listing Category affects loan status any further as this self select category and the borrower may not be giving the true purpose of the loan.
## Min. 1st Qu. Median Mean 3rd Qu.
## "2009-07-20" "2012-02-23" "2013-04-09" "2012-11-15" "2013-11-05"
## Max.
## "2014-03-12"
Loans with an origination date of 2011 and later could still have been in progress as of 2014(as indicated by loan status 1-7) so the final outcome of the loan(i.e. Completed or Defaulted) is unknown. I am not going to do further analysis on year and Loan Status I just wanted to see the yearly trend for completion vs. default.
## [1] "Correlation between Status Code and Mean Debt"
## [1] 0.4123365
Status Code vs. DebtToIncomeRatio Mean graph shows reasonable scatter from which a linear model could be created & positive correlation of .41 indicating a slight positive linear relationship
## [1] "Correlation betweeen Status Code and APR mean"
## [1] 0.8554174
## [1] "Mean Borrower's APR"
## [1] 0.2266582
Since our univariate analysis showed that the Borrower’s APR was fairly symmetrical with a single peak near the center of the data it is realistic to use the mean of .23 as the center of the distribution. That combined with the fact that Status Code and the APR’s mean have a strong correlation with r = .86 it appears that, for this data, the higher the borrower’s APR is above the mean of .23 the more likely the borrower had late payments or defaulted.
## [1] "Corrrelation betweeen Status Code and Prosper Rating Mean"
## [1] -0.8436804
## [1] "Mean Prosper Rating"
## [1] 4.072243
The relationship between Prosper Rating mean vs Status code is strong and negative (-.84). In other words the lower the prosper rating the more likely borrower’s had loans that were in a late or defaulted status.
## [1] "Corrrelation betweeen Status Code and Credit Score Mean"
## [1] -0.6143982
## [1] "Mean Credit Score"
## [1] 708.8902
The relationship between Credit Score Mean vs Status Code is moderately strong and negative with r = -.61. Credit Score used in conjunction with the Prosper Rating could help an investor select loans most likely to end up closed rather than defaulted. Most points are clustered around 698.
dim(spd.sts)
## [1] 26210 19
table(spd.sts$StatusCode)
##
## Closed Defaulted
## 19869 6341
For the rest of the analysis I’m only interested in Status Code 0 & 8. What will happen with Statuscodes 1-7 is unknown as they are in progress. This means I am now treating Status Code as a qualitative variable. Reducing the data set gives me 26,210 records for analysis. 6,341 of theses are in a default status.
The above two box plots examine APR and Prosper Rating for closed and defaulted loans.
The borrower’s APR 50th percentile is approximately 25% for completed loans while it is at about 30% for defaulted loans. Note the triangle on the graphs represents the mean of each. Since these medians and means are different this could be an indication that knowing a borrower’s APR in addition to Prosper Rating could be useful.
Similarly the Prosper Rating’s 50th percentile is four for completed loans while for defaulted loans it is three and 75% of borrower’s who default on a loan have a Prosper Rating of four or less.
## CurrentDelinquencies
## StatusCode 0 1 2 3 4 5 6 7 8 9 10
## Closed 0.78 0.69 0.69 0.62 0.56 0.58 0.64 0.62 0.68 0.62 0.56
## Defaulted 0.22 0.31 0.31 0.38 0.44 0.42 0.36 0.38 0.32 0.38 0.44
The distributions of Current Delinquencies are quite similar for completed and defaulted loans. They both share the same 75th percentile of 0. Outliers exist at 1 & up so I decided to focus on them. Borrower’s with one or more delinquencies have a greater percent of defaults compared to those with 0 delinquencies. For example for borrower’s with just 1 delinquency the default rate is about 31% but for those with 0 delienquencies the default rate is about 22%.
Outliers with values between .5 and 10 exist for both closed and defaulted loans but don’t occur with high frequency. I am zooming in on debt to income ratios between 0 and .5 For loans in a closed status 75% of borrowers had a Debt to Income Ratio of .29 or less while for loans in a Defaulted Status 75% of borrowers had a Debt to Income Ratio of .34. The difference between medians of the two status is only 5% but it does appear that having a lower debt to Income Ratio may be related to Status.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 669.5 709.5 715.7 749.5 889.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 669.5 689.5 697.1 729.5 869.5
75% of borrowers with closed loans had a mean credit score of 750 or less while for defaulted loans the 75th percentile is about 730.
While a difference of 20 points does not seems important perhaps it is large enough to indicate a relationship between Credit Score Mean and Loan Status.
Current Credit Lines, Reason(ie. Listing Category), and Original Loan Amount have similar distributions for both completed and defaulted loans so it does not appear that their is a relationship between them and default status.
Whether or not someone is a homeowner does not seem to impact whether or not they will default on a loan. Approximately the same number of homeowners and non homeowners have loans in closed status and while non homeowners may have defaulted on loans in greater number it is not by that much as depicted in the graph.
Again this is a self selected category and no Occupation stands out more than another as having more or less defaults compared to closed. It doesn’t appear a relationship exists between Status Code and Occupation. # Bivariate Analysis
slight differences in Debt to Income Ratio, Credit Score, and number of delinquencies for closed vs. defaulted loans exist. I was hoping to find a more dramatic relationship but nothing is standing out yet. Perhaps a variable will stand out more in the multivariate analysis.
During this analysis it became clear that I could not treat Loan Status as a quantitative variable and decided to just focus on closed vs defaulted status. What is most surprising is the seemingly lack of a strong relationship. For example the distribution between home owners/non owners and whether a person defaulted on a loan is about the same. I started this project believing homeowers would be more likely to complete loans than non-homeowners which was an incorrect assumption.
The strongest linear relationship is between Prosper Rating Mean vs. Status Code but a linear model here just doesn’t seem appropriate. Treating Status code as a qualitative(Closed vs Default) variable I am seeing a strong relationship between Status Code vs Borrower’s APR as indicated by the box plots and scatterplot of Status code vs. Borrower’s APR mean.
For this section I have decided to just work with the dataframe containing status code 0 (Loan in Closed Status or pending closure) and status code 8 (Defaulted). Also please note that numeric Prosper Ratings range from 1(Higher Risk - “HR”) to 7(Lower Risk - “AA”)
The Prosper Rating density graph makes it very clear that investing in a loan that has been rated below 4 is a risky investment but I want to see what else might help us predict the likelihood of a borrower defaulting.
For borrowers with zero Delinquencies, the percent of closed loans is greater than defaulted but as soon as 1 or more delinquencies are found, loans in default status are almost equal to those in closed.
Borrowers with 0 to 5 credit lines actually have more loans in a default status and those with 5 to about 18 have more loans in a close status but these densities are similar and have significant overlap.
The distributions of Debt to Income Ratio’s for Closed and Defaulted loans also have very similar shapes. Borrower’s with a Debt To Income Ratio between approximately 5% and 25% have a greater number of loans in closed status and then once the Debt To Income Ratio exceeds 25% the occurrence of defaulted loans is greater than closed loans.
An APR of 25% seems to be a tipping point for the likelihood of a loan being in default status. Below 25% more loans are closed than defaulted. Above 25% and defaults are more common.
The last density indicates borrowers with a credit score less than approximately 710 had a greater percentage of loans in a default status
## Rating StatusCode Min Q1 Median Mean Q3 Max
## 1 AA Closed 1 7 9 9.871 12.0 37
## 2 AA Defaulted 1 7 10 10.080 12.5 25
## 3 A Closed 0 7 9 10.180 13.0 39
## 4 A Defaulted 1 6 9 9.849 12.0 30
## 5 B Closed 1 6 9 9.784 12.0 48
## 6 B Defaulted 1 5 8 8.981 12.0 30
## 7 C Closed 0 6 9 10.210 13.0 34
## 8 C Defaulted 0 6 9 9.238 13.0 28
## 9 D Closed 0 6 9 9.417 12.0 59
## 10 D Defaulted 0 5 8 8.603 11.0 35
## 11 E Closed 0 6 9 9.731 13.0 44
## 12 E Defaulted 0 5 8 9.071 12.0 41
## 13 HR Closed 0 5 8 9.411 12.0 35
## 14 HR Defaulted 0 4 8 8.545 12.0 39
Since the above distributions are so similar knowing the number of current credit lines on top of Prosper Rating does not give us any more information about whether or not the borrower will have a loan in a default status
## Rating StatusCode Min Q1 Median Mean Q3 Max
## 1 AA Closed 0.01 0.08 0.12 0.1453 0.18 10.01
## 2 AA Defaulted 0.03 0.12 0.17 0.1940 0.24 0.51
## 3 A Closed 0.01 0.12 0.17 0.1957 0.24 5.64
## 4 A Defaulted 0.02 0.13 0.18 0.2263 0.27 2.79
## 5 B Closed 0.00 0.13 0.18 0.2094 0.26 10.01
## 6 B Defaulted 0.02 0.13 0.18 0.2125 0.27 1.32
## 7 C Closed 0.01 0.15 0.22 0.2466 0.31 5.56
## 8 C Defaulted 0.00 0.15 0.23 0.3093 0.34 10.01
## 9 D Closed 0.01 0.14 0.21 0.2421 0.31 10.01
## 10 D Defaulted 0.01 0.15 0.22 0.2870 0.33 10.01
## 11 E Closed 0.00 0.16 0.24 0.2699 0.33 4.89
## 12 E Defaulted 0.02 0.16 0.26 0.3353 0.38 10.01
## 13 HR Closed 0.00 0.16 0.24 0.3271 0.34 10.01
## 14 HR Defaulted 0.01 0.16 0.25 0.3469 0.38 10.01
## DebtToIncomeRatio.NA's
## 1 65
## 2 6
## 3 186
## 4 30
## 5 158
## 6 47
## 7 260
## 8 117
## 9 500
## 10 276
## 11 388
## 12 234
## 13 420
## 14 301
Debt To Income ratio gives a bit more information. It appears not all loans given “AA” ratings are equivalent. The medians differ by 5% and the 3rd quartiles by 6%. With the exception of rating “AA” the medians are quite similar for “Closed” & “Defaulted” in each of the Prosper Ratings. A bit more difference exists between the 3rd Quartiles(75th Percentile) for “EE”, and “HR”. This indicates to me that knowing the DebtToIncome ratio of borrower and the median Debt to Income ratio of all closed loans for each Rating would be very useful.
## Rating StatusCode Min Q1 Median Mean Q3 Max
## 1 AA Closed 0 0 0 0.03736 0 10
## 2 AA Defaulted 0 0 0 0.14460 0 6
## 3 A Closed 0 0 0 0.13100 0 21
## 4 A Defaulted 0 0 0 0.21230 0 12
## 5 B Closed 0 0 0 0.23020 0 15
## 6 B Defaulted 0 0 0 0.33840 0 9
## 7 C Closed 0 0 0 0.27990 0 21
## 8 C Defaulted 0 0 0 0.37620 0 13
## 9 D Closed 0 0 0 0.38550 0 21
## 10 D Defaulted 0 0 0 0.48060 0 14
## 11 E Closed 0 0 0 0.50320 0 21
## 12 E Defaulted 0 0 0 0.85880 1 32
## 13 HR Closed 0 0 0 0.60770 1 22
## 14 HR Defaulted 0 0 0 0.59130 1 15
Because number of delinquencies are clustered around 0 & 1 I wanted to see a barchart in addition to the boxplots. With the exception of the “HR” rating, borrower’s with 0 delinquencies have more closed loans than defaulted across ratings. As soon as 1 delinquency is found the percentage in a default status exceeds those in closed. Just having access to whether or not a borrower has any delinquencies exist may be useful.
## Rating StatusCode Min Q1 Median Mean Q3 Max
## 1 AA Closed 689.5 769.5 809.5 800.9 829.5 889.5
## 2 AA Defaulted 729.5 769.5 789.5 790.0 809.5 869.5
## 3 A Closed 649.5 729.5 749.5 753.4 789.5 889.5
## 4 A Defaulted 649.5 729.5 749.5 750.4 789.5 849.5
## 5 B Closed 609.5 689.5 729.5 726.4 749.5 869.5
## 6 B Defaulted 609.5 689.5 729.5 722.0 749.5 829.5
## 7 C Closed 609.5 669.5 689.5 706.1 729.5 869.5
## 8 C Defaulted 629.5 669.5 709.5 709.3 729.5 869.5
## 9 D Closed 609.5 669.5 689.5 694.9 729.5 869.5
## 10 D Defaulted 609.5 649.5 689.5 694.0 729.5 829.5
## 11 E Closed 609.5 649.5 669.5 673.8 689.5 869.5
## 12 E Defaulted 609.5 649.5 669.5 672.2 689.5 869.5
## 13 HR Closed 609.5 649.5 689.5 683.1 709.5 869.5
## 14 HR Defaulted 609.5 669.5 689.5 685.8 709.5 809.5
This one shows for “AA” ratings median score differs by about 20 points with closed loans being higher, yet for “C” Prosper Ratings the median Credit score is actually higher for loans in a default status. For the other Prosper Ratings the distributions for closed & defaulted distributions are fairly similar.
## Rating StatusCode Min Q1 Median Mean Q3 Max
## 1 AA Closed 0.04583 0.07339 0.08341 0.08569 0.09643 0.1936
## 2 AA Defaulted 0.06327 0.08466 0.09736 0.10310 0.11670 0.2137
## 3 A Closed 0.07045 0.11770 0.12780 0.13270 0.14470 0.2481
## 4 A Defaulted 0.10080 0.12400 0.13520 0.14160 0.15940 0.2224
## 5 B Closed 0.08999 0.17160 0.18190 0.18530 0.20200 0.2731
## 6 B Defaulted 0.13110 0.17360 0.19650 0.19570 0.20490 0.3745
## 7 C Closed 0.11160 0.22280 0.23510 0.23550 0.24970 0.3915
## 8 C Defaulted 0.12720 0.23250 0.24760 0.24760 0.26890 0.3745
## 9 D Closed 0.14060 0.27770 0.29260 0.29010 0.30530 0.3915
## 10 D Defaulted 0.19690 0.28700 0.29510 0.29540 0.30530 0.3872
## 11 E Closed 0.17430 0.33040 0.35090 0.34290 0.35650 0.4068
## 12 E Defaulted 0.23120 0.33970 0.35240 0.34860 0.35640 0.4068
## 13 HR Closed 0.20260 0.35640 0.35800 0.35610 0.35800 0.4136
## 14 HR Defaulted 0.20500 0.35640 0.35800 0.35860 0.35800 0.4240
The plot that stands out the most is for “AA”. A greater percentage of loans are in a default status when the APR rate is higher, even by just a percentage point. The median for those in a default status is about 9.7% while for those in closed is about 8.3%. The 3rd quartiles differ by 2% Having access to the borrower’s APR and the median APR rate for closed loans as a reference point would be useful for investors.
Once again it does not appear that being a homeowner influences status of the loan.
Certainly taking into account the Prosper Rating in conjunction with the loan status of closed or defaulted helped to better identify variables which may help predict whether or not a borrower will default on a given loan. Besides the risk rating assigned by Prosper also knowing a borrower’s Mean Credit Score, Debt To Income Ratio, and whether or not the borrower has any delinquencies reported would be helpful to investor’s when trying to determine which loans they wish to fund in order to gain the most profit. Of course for these values to be useful investor’s would also need the corresponding information for all closed loans to use as a reference point.
The density graph of number of delinquencies hilited that even if a person has only one delinquency it can affect the ultimate loan status. I was surprised to see when faceting by Prosper risk rating that their rating system seems to reflect this. While prosper rating does seem fairly accurate knowing a bit more of the details and paying attention to seemingly small differences in financial information an investor may increase their chances of selecting a loan that is more likely to result in closure.
I did not create any models.
While Prosper Risk Rating and Loan status are not linearly correlated the number of observations decreases overall as the rating increases from higher risk(0) to lower risk(7). Since lower risk ratings with defaults and late payments exist I tried to identify other features that may help predict whether or not a loan will be defaulted on.
The rating for which knowing the borrower’s Credit Score seems most critical is “AA”, Prosper’s highest rating. For “AA” closed loans the 75th percentile is about 20 points higher than for defaulted loans. The same is true of the medians. This tells me that an investor selecting a “AA” rating should also select a loan whose borrower has a credit score of 810 or higher to help increase the chance of a closed loan. The other rating with a strange anomaly is “C”. While the 75th percentiles are about the same the median of closed loans is actually about 20 points less than for defaulted loans.
About 95% of “AA” rated loans result in a closed status if a borrower has zero delinquencies and if they have one delinquency about 85% of loans result in closed. Interestingly if they had three or four, 100% of the loans were closed. This indicates to me that Prosper’s calculation for assigning their rating does a good job taking into account delinquencies.
Prosper provides a risk rating on a scale from HR(higher risk, 1) to AA(lower risk, 7) to help investors select an investment. I wanted to see if there were other financial indicators associated with a borrower that influenced loan status. As I progressed through the analysis it became apparent that treating loan status as a quantitative variable was not appropriate and I changed my focus to examining which variables might have the biggest affect on a loan be defaulted on vs. closed.
I was surprised that overall the Prosper Rating by itself is a very reasonable predictor and had to look carefully for subtleties that in conjunction with the rating might help predict the completion or default of a loan. At first I thought a borrower having any delinquencies was a definite indicator but as I looked more carefully at the data and took a subset of the data to only include records with a prosper rating which was added in 2009, I saw that the risk rating does seem to take delinquencies into account in the formula for calculating AA. My analysis indicates knowing the borrower’s APR and Credit Score and associated medians for closed AA ratings could be useful in selecting a loan that has a greater chance of being closed. For all ratings knowing the borrower’s debt to income ratio could also help an investor select a loan that is more likely to be closed.
Future work could include actually doing analysis of variance(ANOVA) to help determine which of the predictor variables, Debt to income ratio, APR, or Credit Score has the greatest influence on loan default. For example the proportion of borrowers who default given the loan has an APR of .23 or less to the proportion of borrowers who default given the loan has an APR greater than .23 could be compared.